Asbestos? Predicting Houses before 1980

Course DS 250

Author

Tyler Binning

Setup

Show the code
from types import GeneratorType
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

class Machine:
    def __init__(self):
        self.denver = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv')
        self.ml = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv')
        self.denver_samp = self.denver.sample(n= 4999)

    def ml_model(self,feature_lst, classifier):
        ml = self.ml
        x = ml.filter(feature_lst)
        y = ml.before1980
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .25, random_state = 2003)
        # create the model

        # train the model
        classifier.fit(x_train, y_train)

        # make predictions
        y_predictions = classifier.predict(x_test)

        feature_names = x.columns
        importances = classifier.feature_importances_
        importances_df = pd.DataFrame({'Features' : feature_names, 'Importances': importances})
        importances_df.head()
        chart = alt.Chart(importances_df).mark_bar().encode(
            x= alt.X('Features'),
            y= alt.Y('Importances')
        ).properties( title='Features and their Importance')

        
        print(metrics.classification_report(y_test,y_predictions))
        return chart
    def explore(self,test):
        df = self.denver_samp
        chart = alt.Chart(df).mark_boxplot().encode(
            x= alt.X(test),
            y= alt.Y('yrbuilt', scale=alt.Scale(domain=(1850,2050)))
        )
        return chart
    def opp(self,test):
        df = self.denver_samp
        chart = alt.Chart(df).mark_circle().encode(
            x= alt.X('yrbuilt', scale=alt.Scale(domain=(1850,2050))),
            y= alt.Y(test)
        )
        return chart

model = Machine()

For setting up this project, I first imported all the packages that I would need. Then inorder to make the play part quicker I created a class where I could type enter in different lists or variable that would return the metrics and the chart showing how effective each feature is to the predictions.

Relationships!

Stories

Show the code
model.explore('stories').properties(
    title='Stories'
)

Using the stories chart we can see one story houses were mostly built before the 1980’s and that three story houses and 4 four story houses were largely built after the 1980’s.

Arcitecture Style

Show the code
model.explore('arcstyle').properties(width=500,height=400).properties(
    title='Arcitecture Style'
)

Looking at the box plots we can get a better understanding of the distribution of when different styles where built throughout time. This can be extremely helpful when looking to see if a house would be built before or after 1980 because a lot of the styles seem to be densely packed clearly before or after the 1980 line, with a few exceptions.

Predictive Model

As I have stated above I collected the whole proccess into a method within a class that I created. I have found that this is a clean organized process where I can plug and play with the different features or classifiers. Heres what happens when I call my method.

Show the code
features = (["numbaths",'stories','livearea', 'gartype_None', 'quality_A', 'quality_C', 'quality_D',
       'quality_X', 'status_I', 'condition_Good', 'sprice','arcstyle_ONE-STORY', 
    'arcstyle_CONVERSIONS', 'arcstyle_ONE AND HALF-STORY',
     'gartype_att/CP', 'gartype_det/CP','condition_Excel', 'condition_Fair','condition_AVG',
     'arcstyle_BI-LEVEL', 'arcstyle_CONVERSIONS', 'arcstyle_ONE AND HALF-STORY',
       'arcstyle_ONE-STORY',
       'arcstyle_TRI-LEVEL', 'arcstyle_TRI-LEVEL WITH BASEMENT', 'arcstyle_TWO-STORY','totunits','finbsmnt'])
model.ml_model(features ,RandomForestClassifier())
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      2158
           1       0.93      0.92      0.93      3571

    accuracy                           0.91      5729
   macro avg       0.90      0.90      0.90      5729
weighted avg       0.91      0.91      0.91      5729

When running this method it returns the metrics about this specific model and a graph showing what features are the most important. So that I can look at before and after of multiple metric and graphs to see which I should include and exclude when expieramenting. As you can see the accuracy in the matrix above is slightly above 90%.

Features

As you can see in the graph above that our top three features of importance are the “living area”, “selling price”, and whether or not it was a “one story house.” I personally feel that houses have been getting been getting larger over time so it would make sense that the 1980’s living area could sit in certain parameters. Houses of certain ages seem to sell for about the same pricces. And as we saw in the exploratory analysis above one story houses are mostly built before 1980.

Metrics

Looking at the metrics matrix above you can see multiple decimal numbers which represent precents. We are going to focus on precision and recall.

Precision

Precision calculates the ability to identify only the relevant data points. So a measure of doing what it’s supposed to do. The number associated with 1 in precision is 0.93 showing use that the model correctly identified the important features 93% of the time.

Recall

Recall is the ability of your model to find all the relevant cases in your model. Showing how many truths were identified out of all the truths that were supposed to be identified.The number associated with 1 in recall is 0.93 showing use that the model correctly identified the relivant features 93% of the time.

Summary

Thoughout this document we were able to look at what features are important in determining whether or not a house was built pre- 1980, and throughout we used important machine learning principals and basics: determining what features to include, what classifier to use, and how we can determine the performance of a model through the use of different metrics.